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Abstract 


Internet mail was originally limited to 7-bit ASCII. MIME added 
support for the use of 8-bit character sets in body parts, and also 
defined an encoded-word construct so other character sets could be 
used in certain header field values. However, full 
internationalization of electronic mail requires additional 
enhancements to allow the use of Unicode, including characters 
outside the ASCII repertoire, in mail addresses as well as direct use 
of Unicode in header fields like "From:", "To:", and "Subject:", 
without requiring the use of complex encoded-word constructs. This 
document specifies an enhancement to the Internet Message Format and 
to MIME that allows use of Unicode in mail addresses and most header 
field content. 


This specification updates Section 6.4 of RFC 2045 to eliminate the 
restriction prohibiting the use of non-identity content-transfer-— 
encodings on subtypes of "message/". 

Status of This Memo 


This is an Internet Standards Track document. 


This document is a product of the Internet Engineering Task Force 


(IETF). It represents the consensus of the IETF community. It has 
received public review and has been approved for publication by the 
Internet Engineering Steering Group (IESG). Further information on 


Internet Standards is available in Section 2 of RFC 5741. 
Information about the current status of this document, any errata, 


and how to provide feedback on it may be obtained at 
http://www.rfc-editor.org/info/rfc6532. 
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Copyright (c) 2012 IETF Trust and the persons identified as the 
document authors. All rights reserved. 


This document is subject to BCP 78 and the IETF Trust’s Legal 
Provisions Relating to IETF Documents 
(http://trustee.ietf.org/license-info) in effect on the date of 
publication of this document. Please review these documents 
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include Simplified BSD License text as described in Section 4.e of 
the Trust Legal Provisions and are provided without warranty as 
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1. Introduction 


Internet mail distinguishes a message from its transport and further 
divides a message between a header and a body [RFC5322]. Internet 
mail header field values contain a variety of strings that are 
intended to be user-visible. The range of supported characters for 
these strings was originally limited to [ASCII] in 7-bit form. MIME 
[RFC2045] [RFC2046] [RFC2047] provides the ability to use additional 
character sets, but this support is limited to body part data and to 
special encoded-word constructs that were only allowed in a limited 
number of places in header field values. 


Globalization of the Internet requires support of the much larger set 
of characters provided by Unicode [RFC5198] in both mail addresses 
and most header field values. Additionally, complex encoding schemes 
like encoded-words introduce inefficiencies as well as significant 
opportunities for processing errors. And finally, native support for 


the UTF-8 charset is now available on most systems. Hence, it is 
strongly desirable for Internet mail to support UTF-8 [RFC3629] 
directly. 


This document specifies an enhancement to the Internet Message Format 
[RFC5322] and to MIME that permits the direct use of UTF-8, rather 
than only ASCII, in header field values, including mail addresses. A 
new media type, message/global, is defined for messages that use this 
extended format. This specification also lifts the MIME restriction 
on having non-identity content-transfer-encodings on any subtype of 
the message top-level type so that message/global parts can be safely 
transmitted across existing mail infrastructure. 


This specification is based on a model of native, end-to-end support 
for UTF-8, which depends on having an "8-bit-clean" environment 
assured by the transport system. Support for carriage across legacy, 
7-bit infrastructure and for processing by 7-bit receivers requires 
additional mechanisms that are not provided by these specifications. 


This specification is a revision of and replacement for [RFC5335]. 
Section 6 of [RFC6530] describes the change in approach between this 
specification and the previous version. 


2. Terminology Used in This Specification 
A plain ASCII string is fully compatible with [RFC5321] and 
[RFC5322]. In this document, non-ASCII strings are UTF-8 strings if 


they are in header field values that contain at least one 
<UTF8-non-ascii> (see Section 3.1). 
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Unless otherwise noted, all terms used here are defined in [RFC5321], 
[RFC5322], [RFC6530], or [RFC6531]. 


The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", “SHALL NOT", 
"SHOULD", “SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 
document are to be interpreted as described in [RFC2119]. 


The term "8-bit" means octets are present in the data with values 
above Ox7F. 


3. Changes to Message Header Fields 
To permit non-ASCII Unicode characters in field values, the header 
definition in [RFC5322] is extended to support the new format. The 
following sections specify the necessary changes to RFC 5322's ABNF. 
The syntax rules not mentioned below remain defined as in [RFC5322]. 
Note that this protocol does not change rules in RFC 5322 for 
defining header field names. The bodies of header fields are allowed 
to contain Unicode characters, but the header field names themselves 


must consist of ASCII characters only. 


Also note that messages in this format require the use of the 
SMTPUTF8 extension [RFC6531] to be transferred via SMTP. 


3.1. UTF-8 Syntax and Normalization 


UTF-8 characters can be defined in terms of octets using the 
following ABNF [RFC5234], taken from [RFC3629]: 


UTF8-non-ascii UTF8-2 / UTF8-3 / UTF8-4 


UTF8-2 = <Defined in Section 4 of RFC3629> 
UTF8-3 = <Defined in Section 4 of RFC3629> 
UTF8-4 = <Defined in Section 4 of RFC3629> 


See [RFC5198] for a discussion of Unicode normalization; 
normalization form NFC [UNF] SHOULD be used. Actually, if one is 
going to do internationalization properly, one of the most often 
cited goals is to permit people to spell their names correctly. 

Since many mailbox local parts reflect personal names, that principle 
applies to mailboxes as well. The NFKC normalization form [UNF] 
SHOULD NOT be used because it may lose information that is needed to 
correctly spell some names in some unusual circumstances. 
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3.2. Syntax Extensions to RFC 5322 

The following rules extend the ABNF syntax defined in [RFC5322] 

[RFC5234] in order to allow UTF-8 content. 

VCHAR =/ UTF8-non-ascii 

ctext =/ UTF8-non-ascii 

atext =/ UTF8-non-ascii 

qtext =/ UTF8-non-ascii 

text =/ UTF8-non-ascii 

; note that this upgrades the body to UTF-8 

dtext =/ UTF8-non-ascii 

The preceding changes mean that the following constructs now all 

UTF-8: 

1. Unstructured text, used in header fields like "Subject:" or 
"Content-description:". 

2. Any construct that uses atoms, including but not limited to 
local parts of addresses and Message-IDs. This includes 
addresses in the "for" clauses of "Received:" header fields. 

3. Quoted strings. 

4. Domains. 

Note that header field names are not on this list; these are sti 

restricted to ASCII. 

3.3. Use of 8-bit UTF-8 in Message-IDs 

Implementers of Message-ID generation algorithms MAY prefer to 

restrain their output to ASCII since that has some advantages, s 

as when constructing "In-reply-to:" and "References:" header fie 

in mailing-list threads where some senders use internationalized 
addresses and others do not. 
3.4. Effects on Line Length Limits 

Section 2.1.1 of [RFC5322] limits lines to 998 characters and 

recommends that the lines be restricted to only 78 characters. 

specification changes the former limit to 998 octets. (Note tha 
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ASCII, octets and characters are effectively the same, but this is 
not true in UTF-8.) The 78-character limit remains defined in terms 
of characters, not octets, since it is intended to address display 
width issues, not line-length issues. 


3.5. Changes to MIME Message Type Encoding Restrictions 


This specification updates Section 6.4 of [RFC2045]. [RFC2045] 
prohibits applying a content-transfer-encoding to any subtypes of 
"message/". This specification relaxes that rule -- it allows newly 


defined MIME types to permit content-transfer-encoding, and it allows 
content-transfer-encoding for message/global (see Section 3.7). 


Background: Normally, transfer of message/global will be done in 
8-bit-clean channels, and body parts will have "identity" encodings, 
that is, no decoding is necessary. 


But in the case where a message containing a message/global is 
downgraded from 8-bit to 7-bit as described in [RFC6152], an encoding 
might have to be applied to the message. If the message travels 
multiple times between a 7-bit environment and an environment 
implementing these extensions, multiple levels of encoding may occur. 
This is expected to be rarely seen in practice, and the potential 
complexity of other ways of dealing with the issue is thought to be 
larger than the complexity of allowing nested encodings where 
necessary. 


3.6. Use of MIME Encoded-Words 


The MIME encoded-words facility [RFC2047] provides the ability to 
place non-ASCII text, but only in a subset of the places allowed by 
this extension. Additionally, encoded-words are substantially more 
complex since they allow the use of arbitrary charsets. Accordingly, 
encoded-words SHOULD NOT be used when generating header fields for 
messages employing this extension. Agents MAY, when incorporating 
material from another message, convert encoded-word use to direct use 
of UTF-8. 


Note that care must be taken when decoding encoded-words because the 
results after replacing an encoded-word with its decoded equivalent 
in UTF-8 may be syntactically invalid. Processors that elect to 
decode encoded-words MUST NOT generate syntactically invalid fields. 
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3.7. The message/global Media Type 


Internationalized messages in this format MUST only be transmitted as 
authorized by [RFC6531] or within a non-SMTP environment that 
supports these messages. A message is a "message/global message" if: 


o it contains 8-bit UTF-8 header values as specified in this 
document, or 


o it contains 8-bit UTF-8 values in the header fields of body parts. 


The content of a message/global part is otherwise identical to that 
of a message/rfc822 part. 


If an object of this type is sent to a 7-bit-only system, it MUST 
have an appropriate content-transfer-encoding applied. (Note that a 
system compliant with MIME that doesn’t recognize message/global is 
supposed to treat it as "application/octet-stream" as described in 
Section 5.2.4 of [RFC2046].) 


The registration is as follows: 


Type name: message 
Subtype name: global 
Required parameters: none 
Optional parameters: none 


Encoding considerations: Any content-transfer-encoding is permitted. 
The 8-bit or binary content-transfer-encodings are recommended 
where permitted. 


Security considerations: See Section 4. 


Interoperability considerations: This media type provides 
functionality similar to the message/rfc822 content type for email 
messages with internationalized email headers. When there is a 
need to embed or return such content in another message, there is 
generally an option to use this media type and leave the content 
unchanged or down-convert the content to message/rfc822. Each of 
these choices will interoperate with the installed base, but with 
different properties. Systems unaware of internationalized 
headers will typically treat a message/global body part as an 
unknown attachment, while they will understand the structure of a 
message/rfc822. However, systems that understand message/global 
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will provide functionality superior to the result of a down- 
conversion to message/rfc822. The most interoperable choice 
depends on the deployed software. 


Published specification: RFC 6532 

Applications that use this media type: SMTP servers and email 
clients that support multipart/report generation or parsing. 
Email clients that forward messages with internationalized headers 


as attachments. 


Additional information: 


Magic number(s): none 
File extension(s): The extension ".u8msg" is suggested. 
Macintosh file type code(s): A uniform type identifier (UTI) of 


"public.utf8-email-message" is suggested. This conforms to 
"public.message" and "public.composite-content", but does not 
necessarily conform to "public.utf8-plain-text". 


Person & email address to contact for further information: See the 
Authors’ Addresses section of this document. 


Intended usage: COMMON 


Restrictions on usage: This is a structured media type that embeds 
other MIME media types. An 8-bit or binary content-transfer- 
encoding SHOULD be used unless this media type is sent over a 
7-bit-only transport. 


Author: See the Authors’ Addresses section of this document. 
Change controller: IETF Standards Process 
4. Security Considerations 


Because UTF-8 often requires several octets to encode a single 
character, internationalization may cause header field values (in 
general) and mail addresses (in particular) to become longer. As 
specified in [RFC5322], each line of characters MUST be no more than 
998 octets, excluding the CRLF. On the other hand, MDA (Mail 
Delivery Agent) processes that parse, store, or handle email 
addresses or local parts must take extra care not to overflow 
buffers, truncate addresses, or exceed storage allotments. Also, 
they must take care, when comparing, to use the entire lengths of the 
addresses. 
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There are lots of ways to use UTF-8 to represent something equivalent 
or similar to a particular displayed character or group of 
characters; see the security considerations in [RFC3629] for details 
on the problems this can cause. The normalization process described 
in Section 3.1 is recommended to minimize these issues. 


The security impact of UTF-8 headers on email signature systems such 
as Domain Keys Identified Mail (DKIM), S/MIME, and OpenPGP is 
discussed in Section 14 of [RFC6530]. 


If a user has a non-ASCII mailbox address and an ASCII mailbox 
address, a digital certificate that identifies that user might have 
both addresses in the identity. Having multiple email addresses as 
identities in a single certificate is already supported in PKIX 
(Public Key Infrastructure using X.509) [RFC5280] and OpenPGP 
[RFC3156], but there may be user-interface issues associated with the 
introduction of UTF-8 into addresses in this context. 


5. IANA Considerations 


IANA has updated the registration of the message/global MIME type 
using the registration form contained in Section 3.7. 
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